{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits import mplot3d\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "import scipy.cluster.hierarchy as shc\n", "\n", "from sklearn.datasets.samples_generator import make_blobs\n", "from sklearn.datasets.samples_generator import make_circles\n", "from sklearn.datasets.samples_generator import make_moons\n", "\n", "from sklearn.cluster import AgglomerativeClustering\n", "from sklearn.cluster import KMeans\n", "\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import silhouette_score\n", "from sklearn.metrics import silhouette_samples\n", "\n", "from sklearn.decomposition import PCA\n", "\n", "from sklearn import datasets\n", "\n", "%matplotlib inline\n", "pd.set_option(\"display.max_columns\", None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 24 - Simulated clusters\n", "\n", "The following code will create 3 clusters in 3-dimensional space using 100 data points. The coordinates of the data points are given in X and which cluster they belong to is given in y." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X, y = make_blobs(n_samples=100, centers=3, n_features=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display X." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display y." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualize the clusters in 2 dimensions using PCA. First create a PCA object and find the new X coordinatese in 2 dimensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a new dataframe containing the new X coordinates and a column with the cluster number." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use a scatter plot to visualize the cluster in 2 dimensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run k-means clustering to predict the clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the predicted cluster in the dataframe you created." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the confusion matrix between the actual and predicted values. How accurate was k-means?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens to the above analysis as you increase the number of features?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens to the above analysis if you use 3 features, but increase the number of clusters?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if you increase both the number of features and the number of clusters?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following method also simulates data. What kind of data is it? Hint: try looking at the data and plotting it" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X, y = make_moons(noise = 0.05)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run k-means clustering to predict clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How accurate is k-means cluster on this dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens to the accuracy if you increase the noise parameter?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does hierarchical clustering perform on the above data sets? " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a `make_circles()` function. What does it do? The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }